Introduction

In the first assignment, we utilized the RNASeq data from Sanghi et al.’s publication “Chromatin Accessibility Associates with Protein-RNA Correlation in Human Cancer” to perform data cleaning, normalization, gene mapping to HUGO symbols, and preliminary analysis. The goal of the original study was to explore the relationship between chromatin structure alterations and molecular phenotypes in cancer by utilizing multi-omics profiling of human tumors. They applied this approach to a total of 36 individuals to obtain thyroid cancer primary tumors, metastases, and patient-matched normal tissue, with a total of 87 samples. The study identified a local chromatin structure that is highly correlated with coordinated RNA and protein expression, particularly within gene-body enhancers, and claimed that local enhancers may be more important for regulating cancer gene expression than distal enhancers. Moreover, the authors found that TFs in the MAPK pathway are actively bound significantly more in tumor and metastases than in normal tissue.

We obtained the dataset with the ID GSE162515 from GEO, which is linked to the study “Chromatin Accessibility Associates with Protein-RNA Correlation in Human Cancer”. The dataset contains a total of 28,883 genes, and the experiment conditions are categorized into 27 normal tissue samples, 30 tumor tissue samples, and 30 metastases tissue samples.

To improve the data quality, we removed genes with less than 1 count per million (cpm) in less than three samples, resulting in the removal of 11,538 genes. We then performed normalization using TMM with the edgeR package to correct the large deviation of means between the untreated and treated groups while still preserving some of the original sample distribution. Interestingly, the data was already well-aligned after removing low counts, so the normalization step only slightly improved the dataset’s quality.

From the post-normalization MDS plot, we observed that the overall separation between each test condition group (T and M) and the normal group is clear.
Figure 1. MDS plot to inspect the sample separation. From the plot, we can see that the overall separation between each test condition group (T and M) to normal group is good, indicating a good dataset quality.

Figure 1. MDS plot to inspect the sample separation. From the plot, we can see that the overall separation between each test condition group (T and M) to normal group is good, indicating a good dataset quality.

Additionally, the variance of the data was relatively consistent with the expected trend, as indicated by the dispersion-squared BCV plot.

Figure 2. Dispersion-squared BCV plot. Genes with low counts have a higher variation, whereas genes with higher counts have a lower variation towards the expected trend. For our dataset, the trend of the count data falls around the common dispersion line in a BCV plot, indicating that the variance of the data is relatively consistent with the expected trend.

Figure 2. Dispersion-squared BCV plot. Genes with low counts have a higher variation, whereas genes with higher counts have a lower variation towards the expected trend. For our dataset, the trend of the count data falls around the common dispersion line in a BCV plot, indicating that the variance of the data is relatively consistent with the expected trend.

To map identifiers, we utilized the package biomaRt with Ensembl data. Ensembl gene IDs in the original dataset were mapped to the corresponding HUGO gene symbols. After removing low counts, genes with duplicate identifiers, and genes that cannot be mapped to HUGO symbols, the final dataset contained 17,368 unique genes.

Download Packages

In this section, we import and install the necessary packages for this assignment, in which we will conduct a differential expression analysis using the normalized dataset and a thresholded over-representation analysis.

if (!requireNamespace("BiocManager", quietly = TRUE)){
  install.packages("BiocManager")}

if (!requireNamespace("GEOmetadb", quietly = TRUE)){
  BiocManager::install("GEOmetadb")}

if (!requireNamespace("circlize", quietly = TRUE))
    install.packages("circlize")

if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
    BiocManager::install("ComplexHeatmap")

if (!requireNamespace("gprofiler2", quietly = TRUE))
    BiocManager::install("gprofiler2")

if (!requireNamespace("ggplot2", quietly = TRUE)){
  install.packages("ggplot2")}

if (!requireNamespace("edgeR", quietly = TRUE)){
  BiocManager::install("edgeR")}

if (!requireNamespace("biomaRt", quietly = TRUE)){
  BiocManager::install("biomaRt")}

if (!requireNamespace("knitr", quietly = TRUE)){
  install.packages("knitr")}

if (!requireNamespace("GEOquery", quietly = TRUE)){
  BiocManager::install("GEOquery")}

if (!requireNamespace("Biobase", quietly = TRUE)){
  BiocManager::install("Biobase")}

if (!requireNamespace("dplyr", quietly = TRUE)){
  install.packages("dplyr")}

if (!requireNamespace("kableExtra", quietly = TRUE)){
  install.packages("kableExtra")}

Load packages

library("GEOmetadb")
library("ggplot2")
library("edgeR")
library("biomaRt")
library("ComplexHeatmap")
library("circlize")
library("dplyr")
library("GEOquery")
library("Biobase")
library("knitr")
library("kableExtra")

Differential Expression Analysis

In this section, we will use edgeR package to perform differential expression analysis.

Retrieve information from Assignment 1.

The normalized data processed from Assignment 1 has been stored in the file named as “normalized_counts_final.txt”. We now load this data into R. The categories for the samples were previously saved in “samples.txt” file. We can also load it into R.

normalized_counts <- read.table("normalized_counts_final.txt")
samples <- read.table("samples.txt")

Examine the normalized dataset

The format of the normalized count dataset has already been processed in a way that can be directly used to plot the heatmap, which would be our next step.

kable(normalized_counts[1:5,1:5], format = "html")
F001.C1.T F002.C1.N F003.C1.M F004.C2.T F005.C2.N
A1BG-AS1 9.941028 4.706365 11.072142 5.9720957 3.615496
A2M 639.332343 806.127267 728.618408 565.2189214 515.208115
A2M-AS1 0.379692 2.312610 1.428664 0.6846989 2.824606
A4GALT 6.282177 14.200239 16.310575 11.2214538 9.151723
AAAS 29.512426 30.713088 28.751854 34.8055261 24.178626

Examine the samples information

knitr::kable(head(samples, 10),  format = "html")
individual tissue_type
F001.C1.T C1 T
F002.C1.N C1 N
F003.C1.M C1 M
F004.C2.T C2 T
F005.C2.N C2 N
F007.C3.T C3 T
F009.C4.T C4 T
F010.C4.N C4 N
F014.C5.N C5 N
F017.C7.T C7 T

Design Model

In this section, we will create a data matrix from our dataset.

In order to perform statistical testing, we need a design matrix the defines our model. Notice that in our dataset, there are two factors:

  1. Tissue types (Normal, tumor, metastases)
  2. Patient (individual)

Hence, ideally, we would like to account for both factors in our design matrix.

Recall that we want to find the genes that are differentially expressed in the tumor and metastases samples contrast to normal tissues, so we first factor the tissue types such that “N” type comes first. This would affect on which tissue type would be chosen as the control value (i.e. intercept) when performing model.matrix function to generate the design matrix.

# Set normal tissue type as the intercept
samples$tissue_type <- factor(samples$tissue_type)
samples$tissue_type <- relevel(samples$tissue_type, ref="N")

# Doesn't really matter for individual people, but we can set C1 as the intercept for the sake of tidiness.
samples$individual <- factor(samples$individual)
samples$individual <- relevel(samples$individual, ref="C1")

# Generate design matrix
model_design <- model.matrix(~ samples$individual + samples$tissue_type)

model_design[1:5,1:5]
##   (Intercept) samples$individualC10 samples$individualC11 samples$individualC12
## 1           1                     0                     0                     0
## 2           1                     0                     0                     0
## 3           1                     0                     0                     0
## 4           1                     0                     0                     0
## 5           1                     0                     0                     0
##   samples$individualC13
## 1                     0
## 2                     0
## 3                     0
## 4                     0
## 5                     0

Analysis Using edgeR

For our downstream analysis, we are going to use edgeR, which is specifically designed for RNASeq data. First, we create the base edgeR object called DGEList. The group we want to define is the tissue type.

d <- edgeR::DGEList(counts = normalized_counts, group = samples$tissue_type)

Check Eligibility

One important underlying assumption for using the Quasi-likelihood model is that the data follows a negative binomial distribution. We need to verify that our dataset indeed meets that assumption.

To do this, we calculate the dispersion and plot to visualize the mean-variance relationship.

d <- edgeR::estimateDisp(d, model_design)
edgeR::plotMeanVar(d,
                   show.raw.vars = TRUE,
                   show.tagwise.vars = TRUE,
                   NBline = TRUE,
                   show.ave.raw.vars = TRUE,
                   show.binned.common.disp.vars = TRUE,
                   main = "Mean-Variance Plot for Normalized Data")
# display legend
legend("topleft", 
       legend=c("Raw Data", "Tagwise Dispersion", "Average Raw Variances", 
                "Binned Common Dispersion", "Negative Binomial Line"), 
       col = c("grey", "lightblue", "maroon", "red", "dodgerblue2"), pch=c(1,1,4,4,NA), lty=c(0,0,0,0,1), lwd=c(1,1,1,1,2), cex=0.6)
Figure 3. Mean-variance plot showing the distribution of data. The dispersion and variance of the data perfectly follows the negative binomial distribution, in which the raw data and the blue negative binomial line alligns.

Figure 3. Mean-variance plot showing the distribution of data. The dispersion and variance of the data perfectly follows the negative binomial distribution, in which the raw data and the blue negative binomial line alligns.


Now, we have created the design matrix and verified the assumption for the data to be negative-binomially distributed, we can proceed to the next stage of our analysis. We used the quasi-likelihood models since our dataset is from an RNASeq experiment and quasi-likelihood models are best suited to handle RNASeq data.

fit <- edgeR::glmQLFit(d, model_design)


Once we have fit the model, we can proceed to calculate differential expression. We will perform the calculation separately for Tumor vs. Normal, and Metastases vs. Normal. To ensure that the significantly differentially expressed genes are not obtained by random, we will perform correction for multiple hypothesis testing using Benjamini-Hochberg approach.

Tumor vs. Normal

In this section, we want to test for differential expression between the treatment group (tissue type T) and the reference group (tissue type N).

qlf_tn <- edgeR::glmQLFTest(fit, coef='samples$tissue_typeT')

qlf_tn_hits <- edgeR::topTags(qlf_tn,sort.by = "PValue", adjust.method = "BH",
                           n = nrow(normalized_counts))
knitr::kable(head(qlf_tn_hits$table), format = "html")
logFC logCPM F PValue FDR
CLDN16 5.384950 5.772560 122.90420 0 0
LHX2 3.473253 1.701027 130.64741 0 0
HS6ST2 4.133289 4.429133 110.57323 0 0
PRR15 5.114034 5.685377 107.41459 0 0
SLIT1 4.810897 5.765108 100.52563 0 0
ABCC11 3.961639 2.420472 99.50698 0 0

We can examine the number of genes pass the threshold and correction. We are using 0.05 as the threshold for p-value as it is commonly used in practice.
Number of genes that passed the threshold.

length(which(qlf_tn_hits$table$PValue < 0.05))
## [1] 7145


Number of genes that passed correction.

length(which(qlf_tn_hits$table$FDR < 0.05))
## [1] 5648


The threshold of 0.05 gives us quite a lot of genes. However, since we are dealing with human disease, which should be more strict to obtain more meaningful hits for further analysis, we choose to use a more stringent threshold of 0.01.
Number of genes that passed the threshold

length(which(qlf_tn_hits$table$PValue < 0.01))
## [1] 5156


Number of genes that passed correction

length(which(qlf_tn_hits$table$FDR < 0.01))
## [1] 3893


The number of genes returned are still sufficient for further analysis.

Metastases vs. Normal

In this section, we test for differential expression between the treatment group (tissue type M) and the reference group (tissue type N).

qlf_mn <- edgeR::glmQLFTest(fit, coef='samples$tissue_typeM')
qlf_mn_hits <- edgeR::topTags(qlf_mn,sort.by = "PValue", adjust.method = "BH",
                           n = nrow(normalized_counts))
knitr::kable(head(qlf_mn_hits$table), format = "html")
logFC logCPM F PValue FDR
CDH16 -6.791713 5.728732 195.1855 0 0
CWH43 -4.967247 3.916405 187.8185 0 0
CHGA -4.588762 1.320610 647.5490 0 0
CLCNKB -3.512404 3.970571 170.1807 0 0
CCDC146 -2.524562 4.409113 161.5505 0 0
APOA1 -4.051688 2.087150 160.0036 0 0

Number of genes that passed the 0.01 threshold

length(which(qlf_mn_hits$table$PValue < 0.01))
## [1] 6590


Number of genes that passed correction with 0.01 threshold

length(which(qlf_mn_hits$table$FDR < 0.01))
## [1] 5477


Each gene is represented by a point in the plot. The horizontal axis of the plot is the log2 fold change and the vertical axis is the -log10p which indicates how likely the differential of a gene is due to actual biological variation.

Volcano Plot (Tumor vs. Normal)

We plot the volcano plot to visualize differential expression of genes for tumor vs. normal tissue samples.

volcano_color_tn = rep('gray', times = nrow(qlf_tn_hits$table))

volcano_color_tn[qlf_tn_hits$table$logFC < 0 &
                        qlf_tn_hits$table$FDR < 0.01
                      & abs(qlf_tn_hits$table$logFC) > 2] <- 'blue'

volcano_color_tn[qlf_tn_hits$table$logFC > 0 &
                        qlf_tn_hits$table$FDR < 0.01 &
                        abs(qlf_tn_hits$table$logFC) > 2] <- 'red'

# Plot the volcano plot
plot(qlf_tn_hits$table$logFC,
     -log(qlf_tn_hits$table$PValue, base=10),
     col = volcano_color_tn,
     xlab = "log2 fold change",
     ylab = "-log10 p",
     main = "Volcano plot for Tumor vs. Normal Tissues"
)

# Add the legend
legend("topleft", legend=c("Downregulated in tumor tissues","Upregulated in tumor tissues", "Not significant"),
       fill = c("blue", "red", "grey"), cex = 0.5)

# Label genes with over 4 LFC
tn_sig <- which(qlf_tn_hits$table$logFC > 5 | qlf_tn_hits$table$logFC < -4)
text(x = qlf_tn_hits$table$logFC[tn_sig] , y = -log10(qlf_tn_hits$table$PValue[tn_sig]),
     label = rownames(qlf_tn_hits$table)[tn_sig], cex = 0.5, adj = c(1, NA), pos = 3)
\label{fig:volcano_tn}Figure 4. Volcano plot for the top differentially expressed genes that pass the 0.01 P-value threshold and has absolute LFC of larger than 2. There is extensive number of upregulated genes in tumor tissues compared to normal tissues, with the highest of exceeding 6 LFC (log2 Fold Change). Several genes are significantly downregulated, having more than 4 LFC. Upregulated genes with absolute log fold change over 5 and downregulated genes with absolute LFC over 4 have been labeled out the corresponding HUGO symbols.

Figure 4. Volcano plot for the top differentially expressed genes that pass the 0.01 P-value threshold and has absolute LFC of larger than 2. There is extensive number of upregulated genes in tumor tissues compared to normal tissues, with the highest of exceeding 6 LFC (log2 Fold Change). Several genes are significantly downregulated, having more than 4 LFC. Upregulated genes with absolute log fold change over 5 and downregulated genes with absolute LFC over 4 have been labeled out the corresponding HUGO symbols.


Volcano Plot (Metastases vs. Normal)

We plot the volcano plot to visualize differential expression of genes for metastases vs. normal tissue samples.

volcano_color_mn = rep('gray', times = nrow(qlf_mn_hits$table))

volcano_color_mn[qlf_mn_hits$table$logFC < 0 &
                        qlf_mn_hits$table$FDR < 0.01
                      & abs(qlf_mn_hits$table$logFC) > 2] <- 'blue'

volcano_color_mn[qlf_mn_hits$table$logFC > 0 &
                        qlf_mn_hits$table$FDR < 0.01 &
                        abs(qlf_mn_hits$table$logFC) > 2] <- 'red'

plot(qlf_mn_hits$table$logFC,
     -log(qlf_mn_hits$table$PValue, base=10),
     col = volcano_color_mn,
     xlab = "log2 fold change",
     ylab = "-log10 p",
     main = "Volcano plot for Metastases vs. Normal Tissues"
    )

legend("topright", legend=c("Downregulated in tumor tissues","Upregulated in tumor tissues", "Not significant"),fill = c("blue", "red", "grey"), cex = 0.5)

mn_sig <- which(qlf_mn_hits$table$logFC > 5 | qlf_mn_hits$table$logFC < -5)
text(x = qlf_mn_hits$table$logFC[mn_sig] , y = -log10(qlf_mn_hits$table$PValue[mn_sig]), label = rownames(qlf_mn_hits$table)[mn_sig], cex = 0.5, adj = c(1, NA), pos = 3)
\label{fig:volcano_mn}Figure 5. Volcano plot for the top differentially expressed genes that pass the 0.01 P-value threshold and has absolute LFC of larger than 2. Genes with absolute log fold change over 5 have been labeled out the corresponding HUGO symbols.

Figure 5. Volcano plot for the top differentially expressed genes that pass the 0.01 P-value threshold and has absolute LFC of larger than 2. Genes with absolute log fold change over 5 have been labeled out the corresponding HUGO symbols.


Heatmap (Tumor vs. Normal)

In order to find the most significantly differentially expressed, we pick the hits that pass a P-value threshold of 0.01, and has absolute value of LFC larger than 2 as the top hits.

# Get top hit genes for T vs N
top_hits_tn <- rownames(qlf_tn_hits)[qlf_tn_hits$table$PValue < 0.01 & abs(qlf_tn_hits$table$logFC) > 2]

Then, we calculate their log2 counts per million value. We add 1 to the value to eliminate mathematical error when the count is 0. This technique is commonly used in practise when calculating log values.

# Calculate logCPM
hm_matrix <- log2(normalized_counts + 1) # Plus 1 to eliminate error when the count is 0.

Obtain data for each category

# Obtain tumor and normal tissue samples
tumor <- grep("T$", colnames(hm_matrix), value=TRUE)
normal <- grep("N$", colnames(hm_matrix), value=TRUE)
metastases <- grep("M$", colnames(hm_matrix), value=TRUE)

Scale heat map matrix by rows.

# Scale heat map matrix by rows
hm_matrix_tn <- t(scale(t(hm_matrix[rownames(hm_matrix) %in% top_hits_tn, c(tumor, normal)])))

Pick the color to be used for the plot. If heatmap only contains non-negative values, we will use only white and red, where white indicates 0, and red indicates positive values.

Otherwise, we will use blue to represent negative values.

We plot the heatmap to visualize differential expression of genes for metastases vs. normal tissue samples.

# Choose colors to be used
if(min(hm_matrix_tn) == 0){
    heatmap_color = colorRamp2(c( 0, max(hm_matrix_tn)),
                             c( "white", "red"))
  } else {
    heatmap_color = colorRamp2(c(min(hm_matrix_tn), 0,
          max(hm_matrix_tn)), c("blue", "white", "red"))
  }

# Create the heatmap from the given matrix, showing the dendrogram for both genes and samples.
name = "Expression level"
hm_tn <- Heatmap(as.matrix(hm_matrix_tn),
      show_row_dend = TRUE, show_column_dend = TRUE,
      col = heatmap_color, show_column_names = TRUE,
      show_row_names = FALSE, show_heatmap_legend = TRUE,
      column_names_gp = grid::gpar(fontsize = 6),
      heatmap_legend_param = list(title = name))

draw(hm_tn,
   column_title="Tumor vs. Normal Tissue Heatmap for Top DE Genes",
   column_title_gp=grid::gpar(fontsize=12))
\label{fig:heatmap_tn}Figure 6. Heatmap for the top differentially expressed genes that pass the 0.01 P-value threshold. There are clear clusterings of genes that are upregulated in tumor tissues, while are downregulated in normal tissues, indicating confident evidence of differentially expressed genes.

Figure 6. Heatmap for the top differentially expressed genes that pass the 0.01 P-value threshold. There are clear clusterings of genes that are upregulated in tumor tissues, while are downregulated in normal tissues, indicating confident evidence of differentially expressed genes.

Heatmap (Metastases vs. Normal)

In order to find the most significantly differentially expressed, we pick the hits that pass a P-value threshold of 0.01, and has absolute value of LFC larger than 2 as the top hits.

# Get top hit genes for M vs N
top_hits_mn <- rownames(qlf_mn_hits)[qlf_mn_hits$table$PValue < 0.01 & abs(qlf_mn_hits$table$logFC) > 2]

Scale heat map matrix by rows.

# Scale heat map matrix by rows
hm_matrix_mn <- t(scale(t(hm_matrix[rownames(hm_matrix) %in% top_hits_mn, c(metastases, normal)])))

Pick the color to be used for the plot. If heatmap only contains non-negative values, we will use only white and red, where white indicates 0, and red indicates positive values.

Otherwise, we will use blue to represent negative values.

# Choose colors to be used
if(min(hm_matrix_mn) == 0){
    heatmap_color = colorRamp2(c( 0, max(hm_matrix_mn)),
                             c( "white", "red"))
  } else {
    heatmap_color = colorRamp2(c(min(hm_matrix_mn), 0,
          max(hm_matrix_mn)), c("blue", "white", "red"))
  }

# Create the heatmap from the given matrix, showing the dendrogram for both genes and samples.
hm_mn <- Heatmap(as.matrix(hm_matrix_mn),
      show_row_dend = TRUE, show_column_dend = TRUE,
      col = heatmap_color, show_column_names = TRUE,
      show_row_names = FALSE, show_heatmap_legend = TRUE,
      column_names_gp = grid::gpar(fontsize = 6),
      heatmap_legend_param = list(title = name))

draw(hm_mn,
   column_title="Metastases vs Normal Tissue Heatmap for Top DE Genes",
   column_title_gp=grid::gpar(fontsize=12))
\label{fig:heatmap_mn}Figure 7. Heatmap for the top differentially expressed genes that pass the 0.01 P-value threshold. There are clear clusterings of genes that are upregulated in metastases tissues, while are downregulated in normal tissues, indicating confident evidence of differentially expressed genes.

Figure 7. Heatmap for the top differentially expressed genes that pass the 0.01 P-value threshold. There are clear clusterings of genes that are upregulated in metastases tissues, while are downregulated in normal tissues, indicating confident evidence of differentially expressed genes.

Differential Expression Analysis Discussion

1. Calculate p-values for each of the genes in your expression set. How many genes were significantly differentially expressed? What thresholds did you use and why?

At first, I employed the commonly used p-value of 0.05, resulting in 7145 genes for the tumor vs. normal tissue analysis, and 8634 genes for the metastases vs. normal tissue analysis, before multiple hypothesis testing corrections. However, this was an extensive set of genes, prompting us to alter the p-value threshold to 0.01, and FDR threshold also to 0.01, to restrict the number of genes to be included. Additionally, we imposed a criterion that a gene should have an absolute log2 fold change of at least 2. This ensured that we only obtained genes with significant differential expression, which we believed to be meaningful and essential for further analysis.

2. Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why? How many genes passed correction?

The two primary approaches for controlling false discovery rate are Bonferroni and Benjamini-Hochberg corrections, and we employed the Benjamini-Hochberg method for multiple hypothesis testing correction. Our objective was to identify significant hits without omitting meaningful ones. Bonferroni method is useful when the number of tests is small and when the tests are independent of each other, but it becomes overly stringent and impractical when the number of tests is large or when the tests are correlated. Since we are dealing with a large dataset with over 80 samples, Bonferroni is not desirable for our purpose. Therefore, we opted for Benjamini-Hochberg correction, which provided a more comprehensive set of genes for downstream analysis. Following this correction, we found that the tumor vs. normal tissue analysis yielded 3893 genes, and the metastases vs. normal tissue analysis yielded 5477 genes that passed the correction.

3. Show the amount of differentially expressed genes using an MA Plot or a Volcano plot. Highlight genes of interest.

Volcano plots and heatmaps for both Tumor vs. Normal and Metastases vs. Normal datasets are shown above. The most significantly differentially expressed genes are labeled out by their HUGO symbols in the figures.

4. Visualize your top hits using a heatmap. Do your conditions cluster together? Explain why or why not.

There are significant clusterings within conditions. This suggests that the tumor tissues and metastases tissues do have genes that are highly differentially expressed compared to normal tissues.

Thresholded Overrepresentation Analysis using g:Profiler

For the final part of this assignment, we will perform a thresholded overrepresentation analysis using g:Profiler. In the previous section, we have compiled a list of differentially expressed genes. Here, we want to further divide them into upregulated and downregulated genes.

Tumor vs. Normal Tissue

upregulated_tn <- qlf_tn_hits[qlf_tn_hits$table$logFC > 0 & qlf_tn_hits$table$PValue < 0.01, ]
downregulated_tn <- qlf_tn_hits[qlf_tn_hits$table$logFC < 0 & qlf_tn_hits$table$PValue < 0.01, ]

We use the R package for g:Profiler to perform the gene enrichment analysis. For correction, we used FDR as it is less stringent than Bonferroni and is introduced as the preferred correction method in class. We used GO Biological Process, GO Molecular Function, and WP as those are the ones used by the author of the original publication. For a more detailed overview of the workflow, please refer to the Discussion subsection located at the end of this section.

Upregulated Genes

tn_up_top_terms_all <- gprofiler2::gost(query = rownames(upregulated_tn), 
                                  organism = "hsapiens", 
                                  exclude_iea = TRUE,
                                  correction_method = "fdr",
                                  sources = c("GO:BP", "REAC", "WP"))

tn_up_top_terms <- data.frame(
  term_name = tn_up_top_terms_all$result$term_name[tn_up_top_terms_all$result$term_size < 500 &
                                               tn_up_top_terms_all$result$term_size > 1],
  term_id = tn_up_top_terms_all$result$term_id[tn_up_top_terms_all$result$term_size < 500 &
                                           tn_up_top_terms_all$result$term_size > 1],
  source = tn_up_top_terms_all$result$source[tn_up_top_terms_all$result$term_size < 500 &
                                         tn_up_top_terms_all$result$term_size > 1]
)

knitr::kable(head(tn_up_top_terms, 10), format = "html")
term_name term_id source
external encapsulating structure organization GO:0045229 GO:BP
extracellular matrix organization GO:0030198 GO:BP
extracellular structure organization GO:0043062 GO:BP
cell projection morphogenesis GO:0048858 GO:BP
cell part morphogenesis GO:0032990 GO:BP
plasma membrane bounded cell projection morphogenesis GO:0120039 GO:BP
response to wounding GO:0009611 GO:BP
neuron projection morphogenesis GO:0048812 GO:BP
cell junction assembly GO:0034329 GO:BP
chemotaxis GO:0006935 GO:BP

For context, let’s examine the top term from each data source.

knitr::kable(rbind(tn_up_top_terms[tn_up_top_terms$source == "GO:BP",][1,],
                   tn_up_top_terms[tn_up_top_terms$source == "REAC",][1,],
                   tn_up_top_terms[tn_up_top_terms$source == "WP",][1,]),
             format = "html")
term_name term_id source
1 external encapsulating structure organization GO:0045229 GO:BP
416 Extracellular matrix organization REAC:R-HSA-1474244 REAC
453 Malignant pleural mesothelioma WP:WP5087 WP

Generate a Manhattan plot to visualize the distribution of the top terms from each data source.

gprofiler2::gostplot(tn_up_top_terms_all) %>% plotly::layout(title = "Manhattan plot for Upregulated genes (Tumor vs. Normal)", font = list(size = 10))

Number of terms:

length(tn_up_top_terms$term_name)
## [1] 472

Downregulated Genes

We do the same for the downregualted genes.

tn_down_top_terms_all <- gprofiler2::gost(query = rownames(downregulated_tn), 
                                  organism = "hsapiens", 
                                  exclude_iea = TRUE,
                                  correction_method = "fdr",
                                  sources = c("GO:BP", "REAC", "WP"))

tn_down_top_terms <- data.frame(
  term_name = tn_down_top_terms_all$result$term_name[tn_down_top_terms_all$result$term_size < 500 &
                                                 tn_down_top_terms_all$result$term_size > 1],
  term_id = tn_down_top_terms_all$result$term_id[tn_down_top_terms_all$result$term_size < 500 &
                                             tn_down_top_terms_all$result$term_size > 1],
  source = tn_down_top_terms_all$result$source[tn_down_top_terms_all$result$term_size < 500 &
                                           tn_down_top_terms_all$result$term_size > 1]
)

knitr::kable(head(tn_down_top_terms, 10),format = "html")
term_name term_id source
aerobic respiration GO:0009060 GO:BP
cellular respiration GO:0045333 GO:BP
oxidative phosphorylation GO:0006119 GO:BP
electron transport chain GO:0022900 GO:BP
proton motive force-driven mitochondrial ATP synthesis GO:0042776 GO:BP
generation of precursor metabolites and energy GO:0006091 GO:BP
proton motive force-driven ATP synthesis GO:0015986 GO:BP
energy derivation by oxidation of organic compounds GO:0015980 GO:BP
respiratory electron transport chain GO:0022904 GO:BP
ATP synthesis coupled electron transport GO:0042773 GO:BP
knitr::kable(rbind(tn_down_top_terms[tn_down_top_terms$source == "GO:BP",][1,],
                   tn_down_top_terms[tn_down_top_terms$source == "REAC",][1,],
                   tn_down_top_terms[tn_down_top_terms$source == "WP",][1,]),
             format = "html")
term_name term_id source
1 aerobic respiration GO:0009060 GO:BP
97 The citric acid (TCA) cycle and respiratory electron transport REAC:R-HSA-1428517 REAC
116 Electron transport chain: OXPHOS system in mitochondria WP:WP111 WP

Plot the Manhattan plot to visualize distribution of terms from each data source for downregulated genes.

gprofiler2::gostplot(tn_down_top_terms_all) %>% 
  plotly::layout(title = "Manhattan plot for Downregulated genes (Tumor vs. Normal)", font = list(size = 10))

Number of terms

length(tn_down_top_terms$term_name)
## [1] 126

All Differentially Expressed Genes

Finally, for all differentially expressed genes.

tn_top_terms_all <- gprofiler2::gost(query = rownames(qlf_tn_hits), 
                                  organism = "hsapiens", 
                                  exclude_iea = TRUE,
                                  correction_method = "fdr",
                                  sources = c("GO:BP", "REAC", "WP"))

tn_top_terms <- data.frame(
  term_name = tn_top_terms_all$result$term_name[tn_top_terms_all$result$term_size < 500 &
                                            tn_top_terms_all$result$term_size > 1],
  term_id = tn_top_terms_all$result$term_id[tn_top_terms_all$result$term_size < 500 &
                                        tn_top_terms_all$result$term_size > 1],
  source = tn_top_terms_all$result$source[tn_top_terms_all$result$term_size < 500 &
                                      tn_top_terms_all$result$term_size > 1]
)

knitr::kable(head(tn_top_terms, 10), format = "html")
term_name term_id source
positive regulation of cell migration GO:0030335 GO:BP
ribonucleoprotein complex biogenesis GO:0022613 GO:BP
autophagy GO:0006914 GO:BP
process utilizing autophagic mechanism GO:0061919 GO:BP
positive regulation of cell motility GO:2000147 GO:BP
positive regulation of locomotion GO:0040017 GO:BP
ribosome biogenesis GO:0042254 GO:BP
vasculature development GO:0001944 GO:BP
regulation of amide metabolic process GO:0034248 GO:BP
regulation of mitotic cell cycle GO:0007346 GO:BP
knitr::kable(rbind(tn_top_terms[tn_top_terms$source == "GO:BP",][1,],
                   tn_top_terms[tn_top_terms$source == "REAC",][1,],
                   tn_top_terms[tn_top_terms$source == "WP",][1,]),
             format = "html")
term_name term_id source
1 positive regulation of cell migration GO:0030335 GO:BP
896 RHO GTPase cycle REAC:R-HSA-9012999 REAC
1316 VEGFA-VEGFR2 signaling pathway WP:WP3888 WP
gprofiler2::gostplot(tn_top_terms_all) %>% plotly::layout(title = "Manhattan plot for All DE genes (Tumor vs. Normal)", font = list(size = 10))

Number of terms:

length(tn_top_terms$term_name)
## [1] 1447

Metastases vs. Normal Tissue

upregulated_mn <- qlf_mn_hits[qlf_mn_hits$table$logFC > 0 & qlf_mn_hits$table$PValue < 0.01, ]
downregulated_mn <- qlf_mn_hits[qlf_mn_hits$table$logFC < 0 & qlf_mn_hits$table$PValue < 0.01, ]

We use the R package for g:Profiler to perform the gene enrichment analysis. For correction, we used FDR as it is less stringent than Bonferroni and is introduced as the preferred correction method in class. We used GO Biological Process, GO Molecular Function, and WP as those are the ones used by the author of the original publication. For a more detailed overview of the workflow, please refer to the Discussion subsection located at the end of this section.

Upregulated Genes

mn_up_top_terms_all <- gprofiler2::gost(query = rownames(upregulated_mn), 
                                  organism = "hsapiens", 
                                  exclude_iea = TRUE,
                                  correction_method = "fdr",
                                  sources = c("GO:BP", "REAC", "WP"))

mn_up_top_terms <- data.frame(
  term_name = mn_up_top_terms_all$result$term_name[mn_up_top_terms_all$result$term_size < 500 &
                                               mn_up_top_terms_all$result$term_size > 1],
  term_id = mn_up_top_terms_all$result$term_id[mn_up_top_terms_all$result$term_size < 500 &
                                           mn_up_top_terms_all$result$term_size > 1],
  source = mn_up_top_terms_all$result$source[mn_up_top_terms_all$result$term_size < 500 &
                                         mn_up_top_terms_all$result$term_size > 1]
)

knitr::kable(head(mn_up_top_terms, 10), format = "html")
term_name term_id source
T cell activation GO:0042110 GO:BP
leukocyte cell-cell adhesion GO:0007159 GO:BP
regulation of leukocyte cell-cell adhesion GO:1903037 GO:BP
regulation of T cell activation GO:0050863 GO:BP
positive regulation of leukocyte cell-cell adhesion GO:1903039 GO:BP
regulation of lymphocyte activation GO:0051249 GO:BP
regulation of cell-cell adhesion GO:0022407 GO:BP
positive regulation of cell-cell adhesion GO:0022409 GO:BP
positive regulation of T cell activation GO:0050870 GO:BP
lymphocyte proliferation GO:0046651 GO:BP

For context, let’s examine the top term from each data source.

knitr::kable(rbind(mn_up_top_terms[mn_up_top_terms$source == "GO:BP",][1,],
                   mn_up_top_terms[mn_up_top_terms$source == "REAC",][1,],
                   mn_up_top_terms[mn_up_top_terms$source == "WP",][1,]),
             format = "html")
term_name term_id source
1 T cell activation GO:0042110 GO:BP
936 Viral mRNA Translation REAC:R-HSA-192823 REAC
1079 TYROBP causal network in microglia WP:WP3945 WP

We can visualize the distribution of top terms from each data source using an Manhattan plot.

gprofiler2::gostplot(mn_up_top_terms_all) %>% plotly::layout(title = "Manhattan plot for upregulated genes (Metastases vs. Normal)", font = list(size = 10))

Number of terms:

length(mn_up_top_terms$term_name)
## [1] 1168

Downregulated Genes

We do the same for the downregualted genes.

mn_down_top_terms_all <- gprofiler2::gost(query = rownames(downregulated_mn), 
                                  organism = "hsapiens", 
                                  exclude_iea = TRUE,
                                  correction_method = "fdr",
                                  sources = c("GO:BP", "REAC", "WP"))

mn_down_top_terms <- data.frame(
  term_name = mn_down_top_terms_all$result$term_name[mn_down_top_terms_all$result$term_size < 500 &
                                                 mn_down_top_terms_all$result$term_size > 1],
  term_id = mn_down_top_terms_all$result$term_id[mn_down_top_terms_all$result$term_size < 500 &
                                             mn_down_top_terms_all$result$term_size > 1],
  source = mn_down_top_terms_all$result$source[mn_down_top_terms_all$result$term_size < 500 &
                                           mn_down_top_terms_all$result$term_size > 1]
)

knitr::kable(head(mn_down_top_terms, 10), format = "html")
term_name term_id source
cilium organization GO:0044782 GO:BP
aerobic respiration GO:0009060 GO:BP
cellular respiration GO:0045333 GO:BP
cilium assembly GO:0060271 GO:BP
generation of precursor metabolites and energy GO:0006091 GO:BP
oxidative phosphorylation GO:0006119 GO:BP
plasma membrane bounded cell projection assembly GO:0120031 GO:BP
cell projection assembly GO:0030031 GO:BP
energy derivation by oxidation of organic compounds GO:0015980 GO:BP
electron transport chain GO:0022900 GO:BP
knitr::kable(rbind(mn_down_top_terms[mn_down_top_terms$source == "GO:BP",][1,],
                   mn_down_top_terms[mn_down_top_terms$source == "REAC",][1,],
                   mn_down_top_terms[mn_down_top_terms$source == "WP",][1,]),
             format = "html")
term_name term_id source
1 cilium organization GO:0044782 GO:BP
116 The citric acid (TCA) cycle and respiratory electron transport REAC:R-HSA-1428517 REAC
132 Electron transport chain: OXPHOS system in mitochondria WP:WP111 WP

Plot the Manhattan plot showing distribution of terms from each data source using list of downregulated genes.

gprofiler2::gostplot(mn_down_top_terms_all) %>% 
  plotly::layout(title = "Manhattan plot for downregulated genes (Metastases vs. Normal)", font = list(size = 10))
length(mn_down_top_terms$term_name)
## [1] 149

All Differentially Expressed Genes

Finally, for all differentially expressed genes.

mn_top_terms_all <- gprofiler2::gost(query = rownames(downregulated_mn), 
                                  organism = "hsapiens", 
                                  exclude_iea = TRUE,
                                  correction_method = "fdr",
                                  sources = c("GO:BP", "REAC", "WP"))

mn_top_terms <- data.frame(
  term_name = mn_top_terms_all$result$term_name[mn_top_terms_all$result$term_size < 500 &
                                            mn_top_terms_all$result$term_size > 1],
  term_id = mn_top_terms_all$result$term_id[mn_top_terms_all$result$term_size < 500 &
                                        mn_top_terms_all$result$term_size > 1],
  source = mn_top_terms_all$result$source[mn_top_terms_all$result$term_size < 500 &
                                      mn_top_terms_all$result$term_size > 1]
)

knitr::kable(head(mn_top_terms, 10), format = "html")
term_name term_id source
cilium organization GO:0044782 GO:BP
aerobic respiration GO:0009060 GO:BP
cellular respiration GO:0045333 GO:BP
cilium assembly GO:0060271 GO:BP
generation of precursor metabolites and energy GO:0006091 GO:BP
oxidative phosphorylation GO:0006119 GO:BP
plasma membrane bounded cell projection assembly GO:0120031 GO:BP
cell projection assembly GO:0030031 GO:BP
energy derivation by oxidation of organic compounds GO:0015980 GO:BP
electron transport chain GO:0022900 GO:BP

Top terms from each data source for all differentially expressed genes (Metastases vs. Normal)

knitr::kable(rbind(mn_top_terms[mn_top_terms$source == "GO:BP",][1,],
                   mn_top_terms[mn_top_terms$source == "REAC",][1,],
                   mn_top_terms[mn_top_terms$source == "WP",][1,]),
             format = "html")
term_name term_id source
1 cilium organization GO:0044782 GO:BP
116 The citric acid (TCA) cycle and respiratory electron transport REAC:R-HSA-1428517 REAC
132 Electron transport chain: OXPHOS system in mitochondria WP:WP111 WP
gprofiler2::gostplot(mn_top_terms_all) %>% 
  plotly::layout(title = "Manhattan plot for All DE genes 
                 (Metastases vs. Normal)", font = list(size = 10))

Number of terms:

length(mn_top_terms$term_name)
## [1] 149

Thresholded Over-representation analysis Discussion

1. Which method did you choose and why?

I chose g:Profiler because it was discussed in lectures, and previous Journal Entry assignment has introduced the g:Profiler to us with its usage. It provides several analysis methods and visualizations for genomic data, which meets our need. Moreover, I had previous experience using its web-interface, and after learning that it also has R package utility, this is a good chance to try it out.

2. What annotation data did you use and why? What version of the annotation are you using?

I choose GO:BP, Reactome, and WikiPathways for annotation because they were previously mentioned in the Journal Assignment, which were also used on human genes, and these three are very comprehensive datasets for human pathways. The version I am using is as follows: - GO:BP releases/2022-12-04 - REAC releases/2022-12-28 - WP releases/20221210

3. How many genesets were returned with what thresholds?

For all three analysis (using upregulated, downregulated, all differentially expressed) for both tumor vs. normal tissue and metastases vs. normal tissue, we used a threshold between 1 and 500. We set the upper bound to 500 because we do not want to include overly broad and generic terms that will not give us meaningful insights into the roles of the differentially expressed genes. The P value threshold is set to 0.01.

Tumor vs. Normal:
Upregulated genes returned 490 gene sets; Downregulated genes returned 154 genes ets; All differentially expressed genes returned 1151 gene sets.

Metastases vs. Normal:
Upregulated genes returned 1180 gene sets; Downregulated genes returned 162 gene sets; All differentially expressed genes returned 162 gene sets.

4. Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?

Tumor vs. Normal:
Taking all DE genes together returned more gene sets than the results for upregulated and downregulated separately, and the results for upregulated genes are about three times more than the downregulated genes. This suggests that the upregulated genes may be more strongly associated with specific biological processes or pathways than the downregulated genes. Similarly, the fact that the combined set of DE genes returned more gene sets than either the upregulated or downregulated genes alone suggests that there may be some shared biological processes or pathways that are affected by both upregulated and downregulated genes.

Metastases vs. Normal:
Upregulated genes returned around 10 times more terms than the result for all DE genes as a whole. This indicates that the upregulated genes are enriched for certain biological functions or pathways more strongly than the overall set of differentially expressed genes.

Interpretation

1. Do the over-representation results support conclusions or mechanism discussed in the original paper?

Yes, the over-representation results support the conclusions and mechanism discussed in the original paper. Terms such as regulation of gene expression showed up within the analysis result for up-regulated gene sets for tumor tissues and metastases tissues. In the original paper, the authors predicted that the tumors would have TFs that interact with the MAPK pathway and regulate gene expression in a way that is relevant to the development and progression of cancer, and indeed they found that TFs in the MAPK pathway are actively bound significantly more in tumor and metastases than in normal tissue. Transcription factors are widely know to play a role in regulating gene expresison, which indicates that our results support the conclusions discussed in the original paper.

2. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.

There are thousands of publications describing how transcription factors regulate gene expressions. In the review paper “Transcription factors and evolution: An integral part of gene expression (Review)” by Thanasis et. al, they discussed the function of TFs as the primary regulators of gene expression (Mitsis et al. (2020)). In Wikipedia, it is also stated that the function of TFs is to regulate gene expression. Thus, we have sufficient support on our results.

References

Chen, Y., A. T. L. Lun, and G. K. Smyth. 2016. “From Reads to Genes to Pathways: Differential Expression Analysis of RNA-Seq Experiments Using Rsubread and the edgeR Quasi-Likelihood Pipeline.”
Davis, S., and P. Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 14: 1846–47.
Durinck, S., Y. Moreau, A. Kasprzyk, S. Davis, B. Moor, A. Brazma, and W. Huber. 2005. “BioMart and Bioconductor: A Powerful Link Between Biological Databases and Microarray Data Analysis.” Bioinformatics 21: 3439–40.
Durinck, S., P. Spellman, E. Birney, and W. Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the r/Bioconductor Package biomaRt.” Nature Protocols 4: 1184–91.
Gu, Z., R. Eils, and M. Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics. https://doi.org/10.1093/bioinformatics/btw313.
Gu, Z., L. Gu, R. Eils, M. Schlesner, and B. Brors. 2014. “Circlize Implements and Enhances Circular Visualization in r.” Bioinformatics 30: 2811–12.
Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12 (2): 115–21.
Kolberg, L., U. Raudvere, I. Kuzmin, J. Vilo, and H. Peterson. 2020. “Gprofiler2– an r Package for Gene List Functional Enrichment Analysis and Namespace Conversion Toolset g:profiler.”
McCarthy, D. J., Chen Y, and G. K. Smyth. 2012. “Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40: 4288–97.
Mitsis, T., A. Efthimiadou, F. Bacopoulou, D. Vlachakis, G. P. Chrousos, and E. Eliopoulos. 2020. “Transcription Factors and Evolution: An Integral Part of Gene Expression (Review.” World Academy of Sciences Journal 2: 3–8. https://doi.org/10.3892/wasj.2020.32.
Morgan, Martin. 2021. “BiocManager: Access the Bioconductor Project Package Repository.”
R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Robinson, M. D., McCarthy DJ, and G. K. Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26: 139–40.